EBA3500 Data analysis with programming
Jonas Moss
BI Norwegian Business School
Department of Data Science and Analytics
The function \(f(x) = c + bx + ax^2\) is called a quadratic function or second-degree polynomial. It takes on four kinds of shapes, depending on the value of \(a,b,c\).
Let the true data generating model be \[Y = \beta_0 + \beta_1X + \beta_2X^2 + \epsilon.\] This a quadratic regression model. It can be regarded as a kind of non-linear regression model, but it usually isn’t.
Define \(X_1 = X\) and \(X_2 = X^2\). Then \[\beta_0 + \beta_1X + \beta_2X^2 = \beta_0 + \beta_1X_1 + \beta_2X_2,\] which implies that the quadratic regression model is actually a multiple linear regression model!
Why do we care? Because linear regression is much easier to compute than non-linear regression!
Five studies examined the relationship between talent and team performance. Two survey studies found that people believe there is a linear and nearly monotonic relationship between talent and performance: Participants expected that more talent improves performance and that this relationship never turns negative. However, building off research on status conflicts, we predicted that talent facilitates performance—but only up to a point, after which the benefits of more talent decrease and eventually become detrimental as intrateam coordination suffers. (Swaab et al., 2014)
So the authors claim there is no increasing relationsship between talent and performance at the top level. That seems plausible considering e.g. Martin Ødegaard!
They did four studies, as is common in psychology, and we will look at one of the football studies. Have a look at the paper for more details.
url = "https://gist.githubusercontent.com/JonasMoss/ae5436bf951da5b3e723ce6fec39e77f/raw/03148a170130686d95f020b81e27bc14b35705ff/talent.csv"
talent = pd.read_csv(url)
sns.lmplot(x = "talent", y = "points", data = talent)<seaborn.axisgrid.FacetGrid at 0x2e924725f60>
The data is evidently not linear. So let’s try the logarithmic transform.
talent["log_talent"] = np.log(talent["talent"] + 1)
sns.lmplot(x = "log_talent", y = "points", data = talent)<seaborn.axisgrid.FacetGrid at 0x2e92465e710>
This loooks quite linear!
import statsmodels.formula.api as smf
fit = smf.ols(formula = 'points ~ np.log(talent + 1)', data=talent).fit()
print(fit.summary()) OLS Regression Results
==============================================================================
Dep. Variable: points R-squared: 0.566
Model: OLS Adj. R-squared: 0.564
Method: Least Squares F-statistic: 268.8
Date: Fri, 18 Nov 2022 Prob (F-statistic): 3.30e-39
Time: 07:38:41 Log-Likelihood: -1408.4
No. Observations: 208 AIC: 2821.
Df Residuals: 206 BIC: 2828.
Df Model: 1
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 273.8867 16.311 16.791 0.000 241.729 306.045
np.log(talent + 1) 247.0227 15.068 16.394 0.000 217.315 276.730
==============================================================================
Omnibus: 25.340 Durbin-Watson: 0.952
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.640
Skew: 0.874 Prob(JB): 2.22e-07
Kurtosis: 3.692 Cond. No. 1.60
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<seaborn.axisgrid.FacetGrid at 0x2e924ae0550>
The fitted function is “U-shaped”. It appears that points first increases in talent, then decreases. At least when you look at the fitted curve!
OLS Regression Results
==============================================================================
Dep. Variable: points R-squared: 0.464
Model: OLS Adj. R-squared: 0.459
Method: Least Squares F-statistic: 88.87
Date: Fri, 18 Nov 2022 Prob (F-statistic): 1.61e-28
Time: 07:38:41 Log-Likelihood: -1430.3
No. Observations: 208 AIC: 2867.
Df Residuals: 205 BIC: 2877.
Df Model: 2
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 305.3440 17.627 17.323 0.000 270.591 340.097
talent 54.8979 5.469 10.039 0.000 44.116 65.680
I(talent ** 2) -0.5702 0.075 -7.604 0.000 -0.718 -0.422
==============================================================================
Omnibus: 18.548 Durbin-Watson: 0.654
Prob(Omnibus): 0.000 Jarque-Bera (JB): 21.285
Skew: 0.781 Prob(JB): 2.39e-05
Kurtosis: 3.128 Cond. No. 988.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
How can you model non-linear relationsships? Not all functions are closely approximated by linears, e.g., np.sin(2 * np.pi * x) * x * np.exp(x).
np.vander function to construct design matrices consisting of polynomials.[(5, 770.3176318449504),
(6, 765.6739087918303),
(7, 722.3531744500278),
(8, 631.4232378065403),
(9, 611.365210202308),
(10, 570.3460457476948),
(11, 516.8364848836604),
(12, 518.5146827024718),
(13, 516.993964244756),
(14, 518.9040946323105),
(15, 520.2862547758479)]
Another option is splines.
They are piecewise polynomials glued together.
Just like polynomials, any function can be approximated by splines if the degree of freedom is high enough!
You can create spline bases using the bs function from patsy, e.g., y ~ bs(x, df = 5, degree = 3). The degree corresponds to the degree of the polynomial, three being the common option.
Details are not on the curriculum – but you should know that splines > polynomials for modeling of arbitrary shapes!
Take a look at this video for more information.
We have already modelled this:
Define a fitter function for the desired degrees of freedom:
Optimization terminated successfully.
Current function value: 0.458864
Iterations 10
Logit Regression Results
==============================================================================
Dep. Variable: z No. Observations: 100
Model: Logit Df Residuals: 94
Method: MLE Df Model: 5
Date: Fri, 18 Nov 2022 Pseudo R-squ.: 0.3372
Time: 07:38:42 Log-Likelihood: -45.886
converged: True LL-Null: -69.235
Covariance Type: nonrobust LLR p-value: 6.551e-09
============================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 7.5252 3.288 2.289 0.022 1.081 13.969
bs(x, degree=3, df=5)[0] -12.2394 4.993 -2.451 0.014 -22.026 -2.453
bs(x, degree=3, df=5)[1] -2.2668 2.716 -0.835 0.404 -7.590 3.057
bs(x, degree=3, df=5)[2] -15.3265 4.839 -3.168 0.002 -24.810 -5.843
bs(x, degree=3, df=5)[3] 8.0071 4.657 1.719 0.086 -1.121 17.135
bs(x, degree=3, df=5)[4] -65.6163 20.183 -3.251 0.001 -105.174 -26.058
============================================================================================
Possibly complete quasi-separation: A fraction 0.12 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
def fit_model(df):
formula = "y ~ bs(x, degree = 3, " + "df = " + str(df) + ")"
return smf.ols(formula, data = pd.DataFrame({"x" :x, "y":y})).fit()
[(df, fit_model(df).aic) for df in range(3, 16)][(3, 769.919848597824),
(4, 768.489935931196),
(5, 770.497611464271),
(6, 733.230555729813),
(7, 585.9675275398621),
(8, 573.4227183375275),
(9, 575.0085899189091),
(10, 543.2839496203144),
(11, 519.8409155393351),
(12, 517.3087684885444),
(13, 513.5928894005049),
(14, 515.7660420431214),
(15, 517.8746224632054)]
Let \(x_1, x_2, x_3, \ldots, x_m\) be iid Bernoulli with some unknown \(p\).
We can estimate \(p\) using \(\overline{x}\). Its variance is \(p(1-p)/n\).
But what happens if we need to estimate the odds \(p/(1-p)\) instead of \(p\)? What’s the variance then?
Let \(g\) be a differentiable function and
\[ \sqrt{n}(\hat{\theta}_n -\theta)\stackrel{d}{\to}N(0,\sigma^2) \]
where \(\sigma^2\) is the asymptotic variance. Then
\[ \sqrt{n}(g(\hat{\theta}_n) -g(\theta))\stackrel{d}{\to}N(0,[g'(\theta)]^2\sigma^2) \]
In our case \(g(p)=p/(1-p)\).
Its derivative is \(1/(1-p)^2\). (Found using the quotient rule.)
Recall that the asymptotic variance of \(\overline{x}\) (i.e., variance of \(\sqrt{n}(\overline{x}-p)\) ) is \(\sigma^2 = p(1-p)\).
Thus \(g'(p)^2\sigma^2=p(1-p)/(1-p)^4=p/(1-p)^3\).
Hence the variance of the estimated odds is \(\overline{x}/(1-\overline{x})\) is \(p/(1-p)^3/n\)
Suppose \(X\) has variance \(\sigma^2\) and unknown mean \(\mu\). How can we estimate \(\mu^2\) and what is its asymptotic variance?
Suppose \(\hat{\beta}_1\) is a regression coefficient from the logistic model. What is the asymptotic variance of \(\exp{\hat{\beta}_1}\)?
Answer: Recall that the entire parameter vector \(\hat{\beta}\) has covariance matrix equal to the inverse of the Fisher information \(I^{-1}\), hence the variance of \(\hat{\beta}_i\) is the \(i\)th diagonal element of \(I^{-1}\). Now apply the delta method to \(g(x)=\exp(x)\), yielding \((g'(x))^2 = \exp(2x)\). Hence the asymptotic variance of \(\exp{\hat{\beta}_1}\) is \(\exp(2\beta_1)I^{-1}_{ii}\).
Recall that the gradient of function \(g:\mathbb{R}^k\to\mathbb{R}\) is \[\nabla g(\theta)=\left[\begin{array}{cccc} \frac{\partial g}{\partial\theta_{1}} & \frac{\partial g}{\partial\theta_{2}} & \cdots & \frac{\partial g}{\partial\theta_{k}}\end{array}\right]\]
Example. Consider the function \[g(\theta)=\theta_{1}^{2}+\theta_{2}^{2}+\theta_{3}^{2}\]
The gradient is \[\nabla g(\theta)=\left[\begin{array}{c} \frac{\partial g}{\partial\theta_{1}}\\ \frac{\partial g}{\partial\theta_{2}}\\ \frac{\partial g}{\partial\theta_{k}} \end{array}\right]=2\left[\begin{array}{c} \theta_{1}\\ \theta_{2}\\ \theta_{3} \end{array}\right]\]
Assume that \(\theta\) is a vector and \[\sqrt{n}(\hat{\theta}-\theta)\stackrel{d}{\to}N(0,\Sigma).\] The matrix \(\Sigma\) is the asymptotic covariance matrix.
\[\sqrt{n}(g(\hat{\theta})-g(\theta))\stackrel{d}{\to}N(0,\nabla g(\theta)^{T}\Sigma\nabla g(\theta))\]
cov_params().n*model.cov_params().Optimization terminated successfully.
Current function value: 0.573147
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: admit No. Observations: 400
Model: Logit Df Residuals: 394
Method: MLE Df Model: 5
Date: Fri, 18 Nov 2022 Pseudo R-squ.: 0.08292
Time: 07:38:44 Log-Likelihood: -229.26
converged: True LL-Null: -249.99
Covariance Type: nonrobust LLR p-value: 7.578e-08
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -3.9900 1.140 -3.500 0.000 -6.224 -1.756
C(rank)[T.2] -0.6754 0.316 -2.134 0.033 -1.296 -0.055
C(rank)[T.3] -1.3402 0.345 -3.881 0.000 -2.017 -0.663
C(rank)[T.4] -1.5515 0.418 -3.713 0.000 -2.370 -0.733
gre 0.0023 0.001 2.070 0.038 0.000 0.004
gpa 0.8040 0.332 2.423 0.015 0.154 1.454
================================================================================
cov_params()cov_params(). Intercept C(rank)[T.2] C(rank)[T.3] C(rank)[T.4] gre \
Intercept 1.299488 -0.084476 -0.048644 -0.089431 -0.000301
C(rank)[T.2] -0.084476 0.100166 0.069566 0.070127 -0.000002
C(rank)[T.3] -0.048644 0.069566 0.119237 0.069742 0.000019
C(rank)[T.4] -0.089431 0.070127 0.069742 0.174583 0.000012
gre -0.000301 -0.000002 0.000019 0.000012 0.000001
gpa -0.303660 0.004521 -0.009469 0.003568 -0.000124
gpa
Intercept -0.303660
C(rank)[T.2] 0.004521
C(rank)[T.3] -0.009469
C(rank)[T.4] 0.003568
gre -0.000124
gpa 0.110104